Scraping and Cleaning Economic Data

Inter-American Development Bank Workshop

Diego A. Guerrero

2025-11-23

Learning Objectives

  • Retrieve traditional and non-traditional data from web sources
  • Apply basic download and analysis to satellite imagery and Google search trends.
  • Perform data cleaning using standard economic adjustments: deflation, seasonal adjustment, and growth rates.
  • Estimate a simple Ordinary Least Square model to macroeconomic data

Scraping Economic Data

Accessing data in the web

Application Programming Interfaces (APIs)

There tends to be thorough documentation on how to access the API. A Key is typically required.

FRED API

Some data sources have python libraries. fredapi is a good example and it is installed with conda install fredapi

import pandas as pd
import matplotlib as plt
from fredapi import Fred
fred = Fred(api_key='6082258bcc5fa73a032d9d60c890f744')
data = fred.get_series('SP500')
data = pd.DataFrame(data)
data.plot()

Multiple queries

import requests
import pandas as pd
series = ["SP500", "DJIA", "NASDAQCOM"]
api_key = "6082258bcc5fa73a032d9d60c890f744"

df = pd.DataFrame()
for serie in series :

    url = f"https://api.stlouisfed.org/fred/series/observations?series_id={serie}&api_key={api_key}&file_type=json"

    response = requests.get(url)
    response = response.json()

    dfi = pd.DataFrame(response['observations'])
    dfi = dfi[['value', 'date']]
    
    dfi.set_index('date', inplace=True)
    dfi.index = pd.to_datetime(dfi.index)

    dfi.rename(columns = { "value" : f"fred_{serie}" }, inplace=True)
    dfi[f"fred_{serie}"] = dfi[f"fred_{serie}"].apply(pd.to_numeric, errors='coerce')
    
    df = df.merge(dfi, left_index=True, right_index=True, how='outer')
    df = pd.DataFrame(df.resample("D").mean())
    df.sort_index(inplace=True)
  
df.tail(5)
fred_SP500 fred_DJIA fred_NASDAQCOM
date
2025-10-31 6840.20 47562.87 23724.96
2025-11-01 NaN NaN NaN
2025-11-02 NaN NaN NaN
2025-11-03 6851.97 47336.68 NaN
2025-11-04 6771.55 47085.24 NaN

Direct web requests

When no API is available, data can still be retrieved from static web resources. Files are downloaded using HTTP requests.

Formats: .csv, .xlsx, .txt, .zip, .pdf

pandas can sometimes manage this directly but sometimes you need a request

#| echo: false
import pandas as pd
url = "https://example.com/data.csv"
df = pd.read_csv(url)
#| echo: false
import requests
import pandas as pd
from io import StringIO

# Step 1: Download the CSV file via HTTP request
url = "https://example.com/data.csv"
response = requests.get(url)

# Step 2: Check that the request was successful
if response.status_code == 200:
    # Step 3: Convert content into DataFrame
    csv_data = StringIO(response.text)
    df = pd.read_csv(csv_data)
    
    print("Download complete. Data preview:")
else:
    print(f"Failed to download file. Status code: {response.status_code}")

Advanced Requests and Headers

Most websites block requests that do not come from explorers. We can use request options to send header information.

hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537'}

url = "https://www.bahamas.gov.bs"
r = requests.get(url=url, verify=None, headers=hdr).content

headers: the website is notified that the request is made through Mozilla/Chrome

verify: SSL certificate verification

Request output with Beautiful Soup

Most websites are developed in HTML. Beautiful Soup is a web parser that readers HTML tags.

hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537'}

url = "https://www.bahamas.gov.bs"
r = requests.get(url=url, verify=None, headers=hdr).content
soup = BeautifulSoup(r)

The website is now stored on the variable soup.

We can list all html <a> tags (links) using:

links = [ link.get("href") for link in soup('a') if link.get("href") is not None ]

Opening the file

We can use the trick above to list all .csv or .xlsx files in a website:

hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537'}

url = "https://www.bahamas.gov.bs"
r = requests.get(url=url, verify=None, headers=hdr).content
soup = BeautifulSoup(r)
links = [ link.get("href") for link in soup('a') if link.get("href") is not None and "csv" in link.get('href') ]

The list links will have the URL of all csv files. Now we can open the file:

get = links[0]
response = requests.get(get, verify=None, headers=hdr)
df = pd.read_csv(response.content)

Downloading

Suppose we have stored a file’s url in the variable “get”. Many times we can open this file directly:

response = requests.get(get, stream=True)

Sometimes, however, we need to download the file:

response = requests.get( get , verify=False, timeout=300, headers=hdr)
file_name = "MyFile.pdf"
file_name = f"{folder_spec}/{file_name}"
with open( file_name , 'wb') as file: file.write(response.content)
print(f"Downloaded")

Managing pdfs

pdfs are the most complicated data structures to clean.

PyPDF2 reads the pdf and eases finding the right page.

import tabula, camelot, PyPDF2

def extract_page_text(pdf_path):
  '''
  extract_page_text function opens a pdf an reads all the text. It is used to search specific pages
  '''  
  def extract_page_text(pdf_path):
    with open(pdf_path, 'rb') as pdf_file:
        reader = PyPDF2.PdfReader(pdf_file)
        page_texts = [page.extract_text() for page in reader.pages]
    return page_texts

camelot and tabula specialize in managing pdf Tables.

The pdf process

import tabula, camelot, PyPDF2, requests
import pandas as pd

response = requests.get(get, stream=True)

file_name = f"{folder_spec}/gdp_file.pdf"
with open( file_name , "wb") as pdf_file:
    for chunk in response.iter_content(chunk_size=8192):
        if chunk:
            pdf_file.write(chunk)

# Find the page number
page_texts = extract_page_text(file_name)
table_2_pages = [i + 1 for i, text in enumerate(page_texts) if "TABLE 2" in text ]

# Iterate over all pages with the word TABLE 2 and extract them as tables inside a list "tables"
if table_2_pages :
    tables = []
    for page in table_2_pages:
        page_tables = camelot.read_pdf(file_name, pages=str(page))
        tables.extend(page_tables)

  df0 = pd.DataFrame()
  for x in range(0,len(tables)) :
      # Open the Table
      df = tables[x].df
      # Cleaning

      df0 = pd.concat([df0, df], axis=0)

The pdf process: step-by-step

Opening the file

import camelot, PyPDF2
import pandas as pd

def extract_page_text(pdf_path):
  with open(pdf_path, 'rb') as pdf_file:
      reader = PyPDF2.PdfReader(pdf_file)
      page_texts = [page.extract_text() for page in reader.pages]
  return page_texts

file_name = f"session_2_files/gdp_file.pdf"

# Find the page number
page_texts = extract_page_text(file_name)
table_2_pages = [i + 1 for i, text in enumerate(page_texts) if "TABLE 2" in text ]
print(table_2_pages)
[6, 7, 8]

Data Cleaning

Working dataset

National Budget 10 Year Monthly Data (xlsx)

Open the file with pandas

import pandas as pd

file = "session_2_files/MonthlyData.xlsx"
df = pd.read_excel(file)
df
Unnamed: 0 JUL 2012 AUG 2012 SEP 2012 OCT 2012 NOV 2012 DEC 2012 JAN 2013 FEB 2013 MAR 2013 ... MAY 2024 JUN 2024 JUL 2024 AUG 2024 SEP 2024 OCT 2024 NOV 2024 DEC 2024 JAN 2025 FEB 2025
0 TAX REVENUE (a+b+c+d) 93040348 86483176 78642733 82883677 92179630 133238356 114824826 96878244 104590342 ... 241613741 204402400 255231585 193503688 167475359 236023998 230465826 208805894 278995125 241149985
1 a. TAXES ON PROPERTY 5365152 3671873 4019217 4465873 14157876 30151541 11907058 5451162 6462140 ... 21913491 6984546 9623902 6195951 9551582 7973590 21072882 14207695 20146131 23806222
2 b. TAXES ON GOODS & SEERVICES 37405548 39184491 35059983 36258228 36706457 39313639 58119469 51401840 55054774 ... 141071438 127777154 183129726 117635744 102715248 161040855 124113322 121700941 189330813 165977687
3 i. GENERAL TAXES ON GOODS & SERVICES 31597088 32422854 30854576 31307502 29064451 29947164 37444634 36096632 26008730 ... 116292674 110258234 168021505 111749001 95408539 142005686 105226938 107198452 165107832 112790078
4 VALUE ADDED TAX 0 0 0 0 0 0 0 0 0 ... 107698722 99208190 154091380 100027953 85266170 132555478 95441872 95695606 152428688 103161582
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
83 ACQUISITION OF FIXED ASSETS 9927149 21651950 21174562 5472129 10873272 27465690 20477064 19060318 11022186 ... 16851928 19112442 37315394 36110379 30875113 21811320 25377276 19164010 13350340 9418346
84 LAND 35000 610336 25929 30100 300000 0 263148 151352 0 ... 0 0 0 0 0 188341 18356 684710 135000 1090526
85 TOTAL EXPENDITURE 133417164 155961204 160357472 144566394 149251682 202644733 172360192 136265908 154900104 ... 261681048 270588087 335917578 279538178 252228611 344528344 335545120 291319766 305855273 234356200
86 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
87 SURPLUS/DEFICIT -25056364 -57493129 -67011839 -47832158 -30039299 -44872480 -47925435 -34056397 -34326090 ... 25793212 -35625802 -59079363 -61498551 -64538790 -88404828 -82321717 -42208300 -3793834 58580747

88 rows × 153 columns

Clean up column names

df.columns = df.columns.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower()
for col in df.columns :
    if "unnamed: 0" in col : df.rename(columns = {col : 'variable'}, inplace=True)
df.variable = df.variable.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.lower().str.strip()
df = df.dropna(subset='variable')
df
variable jul 2012 aug 2012 sep 2012 oct 2012 nov 2012 dec 2012 jan 2013 feb 2013 mar 2013 ... may 2024 jun 2024 jul 2024 aug 2024 sep 2024 oct 2024 nov 2024 dec 2024 jan 2025 feb 2025
0 tax revenue (a+b+c+d) 93040348 86483176 78642733 82883677 92179630 133238356 114824826 96878244 104590342 ... 241613741 204402400 255231585 193503688 167475359 236023998 230465826 208805894 278995125 241149985
1 a. taxes on property 5365152 3671873 4019217 4465873 14157876 30151541 11907058 5451162 6462140 ... 21913491 6984546 9623902 6195951 9551582 7973590 21072882 14207695 20146131 23806222
2 b. taxes on goods & seervices 37405548 39184491 35059983 36258228 36706457 39313639 58119469 51401840 55054774 ... 141071438 127777154 183129726 117635744 102715248 161040855 124113322 121700941 189330813 165977687
3 i. general taxes on goods & services 31597088 32422854 30854576 31307502 29064451 29947164 37444634 36096632 26008730 ... 116292674 110258234 168021505 111749001 95408539 142005686 105226938 107198452 165107832 112790078
4 value added tax 0 0 0 0 0 0 0 0 0 ... 107698722 99208190 154091380 100027953 85266170 132555478 95441872 95695606 152428688 103161582
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
82 j. acq of non-financial assets 9962149 22262286 21200491 5502229 11173272 27465690 20740212 19211670 11022186 ... 16851928 19112442 37315394 36110379 30875113 21999661 25395632 19848720 13485340 10508872
83 acquisition of fixed assets 9927149 21651950 21174562 5472129 10873272 27465690 20477064 19060318 11022186 ... 16851928 19112442 37315394 36110379 30875113 21811320 25377276 19164010 13350340 9418346
84 land 35000 610336 25929 30100 300000 0 263148 151352 0 ... 0 0 0 0 0 188341 18356 684710 135000 1090526
85 total expenditure 133417164 155961204 160357472 144566394 149251682 202644733 172360192 136265908 154900104 ... 261681048 270588087 335917578 279538178 252228611 344528344 335545120 291319766 305855273 234356200
87 surplus/deficit -25056364 -57493129 -67011839 -47832158 -30039299 -44872480 -47925435 -34056397 -34326090 ... 25793212 -35625802 -59079363 -61498551 -64538790 -88404828 -82321717 -42208300 -3793834 58580747

83 rows × 153 columns

Select variables

df = df[
    df['variable'].str.startswith("tax revenue")
]
# Add fiscal to all variable names
df['variable'] = 'fiscal ' + df.variable

# clean variable name
df['variable'] = (
    df['variable']
    .str.replace(r"\s*\(.*?\)", "", regex=True)   # remove parentheses + contents
    .str.replace(r"\s+", " ", regex=True)         # collapse multiple spaces
    .str.strip()                                  # trim
    .str.replace(r" ", "_")                       # Change spaces to _
)
df
variable jul 2012 aug 2012 sep 2012 oct 2012 nov 2012 dec 2012 jan 2013 feb 2013 mar 2013 ... may 2024 jun 2024 jul 2024 aug 2024 sep 2024 oct 2024 nov 2024 dec 2024 jan 2025 feb 2025
0 fiscal_tax_revenue 93040348 86483176 78642733 82883677 92179630 133238356 114824826 96878244 104590342 ... 241613741 204402400 255231585 193503688 167475359 236023998 230465826 208805894 278995125 241149985

1 rows × 153 columns

Reshape in action

import numpy as np
df = pd.melt(df, id_vars="variable", var_name="date", value_name='value')
df.date = pd.to_datetime(df.date, infer_datetime_format=True, errors='coerce')
df.set_index('date', inplace=True)
df.loc[df['value']==0, 'value'] = np.nan
df = df.dropna()
df
variable value
date
2012-07-01 fiscal_tax_revenue 93040348
2012-08-01 fiscal_tax_revenue 86483176
2012-09-01 fiscal_tax_revenue 78642733
2012-10-01 fiscal_tax_revenue 82883677
2012-11-01 fiscal_tax_revenue 92179630
... ... ...
2024-10-01 fiscal_tax_revenue 236023998
2024-11-01 fiscal_tax_revenue 230465826
2024-12-01 fiscal_tax_revenue 208805894
2025-01-01 fiscal_tax_revenue 278995125
2025-02-01 fiscal_tax_revenue 241149985

152 rows × 2 columns

Pivot and Save

If we had multiple variables, we would rather have them as columns.

df = df.pivot(columns='variable', values='value')
df.to_csv(f"session_2_files/clean_file.csv")
df
variable fiscal_tax_revenue
date
2012-07-01 93040348
2012-08-01 86483176
2012-09-01 78642733
2012-10-01 82883677
2012-11-01 92179630
... ...
2024-10-01 236023998
2024-11-01 230465826
2024-12-01 208805894
2025-01-01 278995125
2025-02-01 241149985

152 rows × 1 columns

Let’s visualize

import matplotlib as plt
df.plot()

Satellite Data

Nighttime Lights

Nighttime satellite imagery provides a valuable proxy for economic activity, electrification, and urban growth — particularly in areas with limited or unreliable statistical data.

It measures the intensity of light emissions captured by satellites orbiting the Earth.
Data are typically available at fine spatial and temporal resolutions, making it a powerful complement to conventional economic indicators.

Why It’s Complex

Working with satellite data —especially NASA’s Nighttime Lights (VIIRS) data— involves multiple steps and tools.

  • This is one of the most complex code blocks in our workflow, combining geospatial processing, data scraping, and time-series construction.

Workflow Overview

  1. Identify target locations
    • Define the geographic boundaries.
    • Create a shapefile or bounding box for the selected areas.
  2. Download raw .h5 map files
    • Retrieve satellite imagery files directly from NASA’s repository.
    • Often requires authentication and automated scraping (via requests or NASA’s API).
    • Each file corresponds to a specific day/month.

Workflow Overview (continued)

  1. Open and read the .h5 files
    • Use libraries like h5py gdal to extract pixel values.
    • Convert the raster data into usable arrays or DataFrames.
    • Aggregate data over the selected geographic units (e.g., county or grid cell).
  2. Clean and prepare the series
    • Handle missing pixels and noise (cloud cover, stray light, etc.).
    • Average or sum pixel intensities to build a monthly or yearly time series.
    • Save a tidy dataset ready for merging with socioeconomic indicators.

First Estimation

Activity

Appendix

Camelot

Running camelot may need some additional Libraries

conda install -c conda-forge cv2 --yes pip install opencv-contrib-python